Acknowledgement
I am sincerely thank my parents and family for giving me the support and
opportunity to invest my time on learning Machine Learning and
Artificial Intelligence to apply in environmental management work.
Furthermore, I thank the Google Career Certification courses for
providing me the resources to learn {python} Programming and learn about
the Machine Learning Concepts.
Use of generative artificial intelligence
Generative artificial intelligence (GenAI) was mainly used for creating
charts and adjusting visualization parameters in {python}. GenAI was
also used for code debugging. However, the responses provided by GenAI
were critically judged before being implemented.
Problem Statement
Salifort Motors is a fictional French-based alternative energy vehicle
manufacturer. The HR department at Salifort Motors wants to take some
initiatives to improve employee satisfaction levels at the company. They
refer to you as a data analytics professional and ask you to provide
data-driven suggestions based on your understanding of the data. They
have the following question: what’s likely to make the employee leave
the company?
Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company. If the data analyst can predict the factors influencing the employees likely to quit, it might be possible to identify main factors that contribute to their leaving.
Project Aim and Focus
Goals in this project are to analyze the data collected by the HR
department and to build a model that predicts whether or not an employee
will leave the company.
Raw data used
This project uses a dataset called HR_capstone_dataset.csv. It
represents 10 columns of self-reported information from employees of a
fictitious multinational vehicle manufacturing corporation.
Methodology
The following methodology was undertaken for this project, - Raw data -
HR_capstone_dataset.csv from the HR department is used to assess the
needs of the Senior leadership team.
- The merged data set is split into 70% training and 30% test data which
is used to train and predict using machine learning models.
- Analysis such as confusion matrix, feature importance and scoring
metrics is performed to analyse the models performance in predicting the
employee satisfaction levels and the main factors influencing the
employees to quit.
Results
Out of the models, .
Salifort Motors is a fictional French-based alternative energy vehicle manufacturer. Its global workforce of over 100,000 employees research, design, construct, validate, and distribute electric, solar, algae, and hydrogen-based vehicles. Salifort’s end-to-end vertical integration model has made it a global leader at the intersection of alternative energy and automobiles.
The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to the data analytics professional and ask them to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?
Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company. If the data analyst can predict the factors influencing the employees likely to quit, it might be possible to identify main factors that contribute to their leaving.
For this project, the key stakeholders include the HR department and the senior leadership team, as they are directly involved in employee management and decision-making. The senior leadership team has tasked the data analyst with analyzing the dataset to come up with ideas for how to increase employee retention. To help with this, they would like you to build a machine learning model that predicts whether an employee will leave the company based on their department, number of projects, average monthly hours, and any other data points you deem helpful.
Goals
The primary objective is to identify and predict the underlying drivers
contributing to employee turnover, which can help in formulating
effective retention strategies. Goals in this project are to analyze the
data collected by the HR department and to build a model that predicts
whether or not an employee will leave the company.
Methodology
For this project, the analyst chooses a method to approach this data
challenge, either selecting a regression model or a tree-based machine
learning model to predict whether an employee will leave the company.
The following methodology was undertaken for this project,
HR_capstone_dataset.csv from the HR
department is used to assess the needs of the Senior leadership
team.This project uses a dataset called HR_capstone_dataset.csv, which is downloaded from the Kaggle website here. In the EDA, the dataset is analysed and prepared for building the machine learning models. Analysis such as, - Loading the required packages and the data set
First, loading the libraries and packages that are needed for predicting the employee satisfaction project. The selected libraries provide functions for handling data, building and performing machine learning tasks, and visualizing results.
# Import packages
# Operational Packages
import numpy as np
import pandas as pd
import io
import pickle
# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML
from IPython.display import display, Markdown
from tabulate import tabulate
# Modelling packages
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
#XGBoost
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance
# Modelling evaluation and metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.tree import plot_tree
from sklearn.tree import export_text
To start the project, loading the dataset
HR_capstone_dataset.csv, and analyse the basic of the
dataset. The dataset represents 10 columns of self-reported information
from employees of a fictitious multinational vehicle manufacturing
corporation.
# Load dataset into a dataframe
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
# Load CSV
df0 = pd.read_csv(r"D:\Study\Machine Learning\Projects\R-Git\Completed projects for GitHub\Predicting-the-employee-satisfaction-levels-at-Salifort-Motors\Data\HR_capstone_dataset.csv")
# Format first 5 rows like a kable table
df0.head().style.set_table_attributes("class='table table-sm'")
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.380000 | 0.530000 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.800000 | 0.860000 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.110000 | 0.880000 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.720000 | 0.870000 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.370000 | 0.520000 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
In this step, gaining a comprehensive understanding of the data set and preparing it for modelling is essential. This involves reviewing all variables to understand their data types, statistical distributions, and relevance to the target objective.
# Gather basic information about the data
# Create a StringIO buffer
buffer = io.StringIO()
# Capture the output of df.info() into the buffer
df0.info(buf=buffer)
# Get the content from the buffer
info_str = buffer.getvalue()
# Print the content
display(Markdown(f"```\n{info_str}\n```"))
## <IPython.core.display.Markdown object>
# Print the descriptive statistics
df0.describe().style.set_table_attributes("class='table table-sm'")
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
|---|---|---|---|---|---|---|---|---|
| count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
| mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
| std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
| min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
The HR_capstone_dataset.csv dataset contains 14999 row
entries and 10 columns, out of which, 2 are float, 6 are integers and 2
are objects. Upon initial exploration of the data set, most of the
variables in the survey data align with prediction variables but certain
variables can be engineered for effective predictions. Ethical
considerations at this point, is the consideration of the bias in the
recorded data both during the analysis and while interpreting and
presenting the results to ensure fairness and accuracy.
Descriptive analysis of the dataset is shown here. Based on this,
In this step, the HR_capstone_dataset.csv dataset is
then cleaned by addressing missing values, removing redundant or
duplicate entries, and identifying any anomalies or inconsistencies.
Outliers that could potentially distort model performance is also
detected and evaluated for appropriate handling. These steps ensures
that the dataset was accurate, consistent, and ready for further
analysis, laying a solid foundation for building reliable predictive
models.
Rename columns
As a data cleaning step, rename the columns as needed. Standardizing
the column names so that they are all in snake_case,
correcting any column names that are misspelled, and making sure column
names more concise as needed.
# Display all column names
list(df0.columns)
## ['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'left', 'promotion_last_5years', 'Department', 'salary']
# Rename columns as needed
df = df0.copy()
df = df0.rename(columns={'satisfaction_level':'satisfaction',
'last_evaluation':'last_eval',
'number_project':'#_projects',
'average_montly_hours':'avg_mon_hrs',
'time_spend_company':'tenure',
'Work_accident':'work_accident',
'promotion_last_5years':'promotion_<5yrs',
'Department':'department'
})
# Display all column names after the update
list(df.columns)
## ['satisfaction', 'last_eval', '#_projects', 'avg_mon_hrs', 'tenure', 'work_accident', 'left', 'promotion_<5yrs', 'department', 'salary']
Check missing values
Checking for any missing values in the data.
# Check for missing values
df.isnull().sum().reset_index().style.set_table_attributes("class='table table-sm'")
| index | 0 | |
|---|---|---|
| 0 | satisfaction | 0 |
| 1 | last_eval | 0 |
| 2 | #_projects | 0 |
| 3 | avg_mon_hrs | 0 |
| 4 | tenure | 0 |
| 5 | work_accident | 0 |
| 6 | left | 0 |
| 7 | promotion_<5yrs | 0 |
| 8 | department | 0 |
| 9 | salary | 0 |
There appears to be no missing values in this dataset.
Check duplicates
Checking for any duplicate entries in the data.
# Check for duplicates
df.duplicated().sum()
## np.int64(3008)
# Inspect some rows containing duplicates as needed
df[df.duplicated()].head().style.set_table_attributes("class='table table-sm'")
| satisfaction | last_eval | #_projects | avg_mon_hrs | tenure | work_accident | left | promotion_<5yrs | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 396 | 0.460000 | 0.570000 | 2 | 139 | 3 | 0 | 1 | 0 | sales | low |
| 866 | 0.410000 | 0.460000 | 2 | 128 | 3 | 0 | 1 | 0 | accounting | low |
| 1317 | 0.370000 | 0.510000 | 2 | 127 | 3 | 0 | 1 | 0 | sales | medium |
| 1368 | 0.410000 | 0.520000 | 2 | 132 | 3 | 0 | 1 | 0 | RandD | low |
| 1461 | 0.420000 | 0.530000 | 2 | 142 | 3 | 0 | 1 | 0 | sales | low |
# Drop duplicates and save resulting dataframe in a new variable as needed
df1 = df.drop_duplicates(keep='first')
# Display first few rows of new dataframe as needed
df1.head().style.set_table_attributes("class='table table-sm'")
| satisfaction | last_eval | #_projects | avg_mon_hrs | tenure | work_accident | left | promotion_<5yrs | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.380000 | 0.530000 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.800000 | 0.860000 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.110000 | 0.880000 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.720000 | 0.870000 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.370000 | 0.520000 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
Based on the duplicate data set, there are several continuous variables across all the 10 columns which is very highly likely that these observations are duplicates. Therefore dropping them will help in making accurate predictions.
Check outliers
Checking for outliers in the data. Certain types of models are more sensitive to outliers than others. Considering whether to remove outliers, is based on the type of models that will be used in the project.
# Create a boxplot to visualize distribution of `tenure` and detect any outliers
plt.figure(figsize=(16,6))
plt.title('Detecting outliers for tenure (Boxplot)', fontsize=15)
plt.xticks(fontsize=12)
## (array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), [Text(0.0, 0, '0.0'), Text(0.2, 0, '0.2'), Text(0.4, 0, '0.4'), Text(0.6000000000000001, 0, '0.6'), Text(0.8, 0, '0.8'), Text(1.0, 0, '1.0')])
plt.yticks(fontsize=12)
## (array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), [Text(0, 0.0, '0.0'), Text(0, 0.2, '0.2'), Text(0, 0.4, '0.4'), Text(0, 0.6000000000000001, '0.6'), Text(0, 0.8, '0.8'), Text(0, 1.0, '1.0')])
sns.boxplot(x=df1['tenure'])
plt.show()
The box plot shows that there are outliers in the tenure
column. So, checking how many rows contain outliers in the
tenure column.
# Determine the number of rows containing outliers
# 25th Percentile for tenure
percentile25 = df1['tenure'].quantile(0.25)
# 75th Percentile for tenure
percentile75 = df1['tenure'].quantile(0.75)
# IQR - Inter Quartile Range
iqr = percentile75 - percentile25
# Limits of the tenure
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print('Lower limit:', lower_limit)
## Lower limit: 1.5
print('Upper limit:', upper_limit)
## Upper limit: 5.5
# Identifying the outliers in 'tenure'
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]
# print the rows containing the outliers
print(f'Number of rows containing outliers in tenure:', len(outliers))
## Number of rows containing outliers in tenure: 824
Outlier analysis in the tenure column indicates that employees with less than 1.5 years or more than 5.5 years of tenure show notable deviations, with 824 rows flagged as outliers. Most employees tend to leave within 5 years, possibly due to lack of advancement opportunities.
Beginning by understanding how many employees left and what percentage of all employees this figure represents.
# Get numbers of people who left vs. stayed
print(df['left'].value_counts())
## left
## 0 11428
## 1 3571
## Name: count, dtype: int64
print()
# Get percentages of people who left vs. stayed
df['left'].value_counts(normalize=True)
## left
## 0 0.761917
## 1 0.238083
## Name: proportion, dtype: float64
Examining variables that are interesting to the relevance of the project and create plots to visualize relationships between variables in the data.
# Select only numeric columns
numeric_df = df1.select_dtypes(include=['number'])
# Plot a correlation heatmap
plt.figure(figsize=(20, 12))
heatmap = sns.heatmap(
numeric_df.corr(),
vmin=-1,
vmax=1,
annot=True,
fmt=".2f", # optional: format annotation
annot_kws={"size": 12}, # ← font size of annotation inside heatmap
cmap=sns.color_palette("vlag", as_cmap=True),
cbar_kws={"shrink": 0.75, "label": "Correlation"} # optional: color bar label
)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':20}, pad=20);
# Set x and y tick labels size
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation=45, ha='right', fontsize=14)
heatmap.set_yticklabels(heatmap.get_yticklabels(), fontsize=14)
plt.tight_layout()
plt.show()
Correlation heatmap
# PLots to analyse Tenure vs satisfaction; tenure vs left distribution
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (20,8))
# Tenure vs left distribution
tenure_stay = df1[df1['left']==0]['tenure']
tenure_left = df1[df1['left']==1]['tenure']
sns.histplot(data=df1, x='tenure', hue='left', multiple='dodge', shrink=5, ax=ax[0])
ax[0].set_title('Tenure distribution classified by employee who left', fontsize=14)
# Tenure vs Satisfaction
sns.boxplot(data=df1, x='satisfaction', y='tenure', hue='left', orient="h", saturation=0.75, ax=ax[1])
ax[1].legend(loc='upper left', title='Left')
ax[1].invert_yaxis()
ax[1].set_title('Satisfaction vs Tenure', fontsize=14)
plt.show()
Box Plot
Histogram Plot
Histogram distribution shows that only few people stay more than 5 years which might be due to promotions to higher ranks in the company
# plot for #_project vs avg_mon_hrs; distribution of #_projects
fig, ax = plt.subplots(1, 2, figsize = (20,8))
# distribution of #_projects
projects_stay = df1[df1['left']==0]['#_projects']
projects_left = df1[df1['left']==1]['#_projects']
sns.histplot(data=df1, x='#_projects', hue='left', multiple='dodge', shrink=5, ax=ax[0])
ax[0].set_title('No of projects distribution classified by employee who left', fontsize=14)
# #_project vs avg_mon_hrs
sns.boxplot(data=df1, x='avg_mon_hrs', y='#_projects', hue='left', orient="h",saturation=0.75, ax=ax[1])
ax[1].legend(loc='upper left', title='Left')
ax[1].invert_yaxis()
ax[1].set_title('Average monthly hours by No of project', fontsize=14)
plt.show()
Based on the plots,
Histogram
Box Plots
Employees who left the company,
# Plots for satisfaction vs salary; satisfaction vs last_eval;
fig, ax = plt.subplots(1, 2, figsize = (20,8))
# plot for satisfaction vs salary
sns.boxplot(data=df1, x='satisfaction', y='salary', hue='left',
orient="h", saturation=0.75, ax=ax[0])
ax[0].invert_yaxis()
ax[0].legend(loc='upper left', title='Left')
ax[0].set_title('Satisfaction vs Salary', fontsize=14)
# Plot for satisfaction vs avg_mon_hrs
sns.scatterplot(data=df1, x='satisfaction', y='avg_mon_hrs', hue='left', alpha=0.4, ax=ax[1])
ax[1].set_title('Satisfaction level by average monthly work hours', fontsize=14)
Based on the plots,
Box plot
Salary has high relation with the satisfaction level. At low and medium salary level, there is very low satisfaction scores and high number of employees who left the company.
Scatter plot
Employees dissatisfaction level is very low who worked for long hours in the company and has a less than 0.5 satisfaction level aligns with employees who worked less hours which might be due to that they are fired or might have given notice to leave the company. This confirms with the previous box plots.
# Plot for avg_mon_hrs vs last_eval; avg_mon_hrs vs promotion_<5yrs
fig, ax = plt.subplots(1, 2, figsize = (20,8))
# Plot for avg_mon_hrs vs promotion_<5yrs
sns.scatterplot(data=df1, x='avg_mon_hrs', y='promotion_<5yrs', hue='left', ax=ax[0])
ax[0].set_title('Average monthly hours by promotion in the last 5 years', fontsize=14)
# Plot for avg_mon_hrs vs last_eval
sns.scatterplot(data=df1, x='avg_mon_hrs', y='last_eval', hue='left', alpha=0.4, ax=ax[1])
ax[1].set_title('Average monthly hours by evaluation score', fontsize=14)
Based on the plot,
Avg_mon_hrs vs Promotion_<5yrs
avg_mon_hrs vs last_eval Employeed who left,
# Plot for distribution of employee who left by department
plt.figure(figsize=(13,10))
sns.histplot(data=df1, x='department', hue='left', discrete=1,
hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.title('Employees distribution classified by department', fontsize=14)
plt.show()
Sales, Technical and Support department are the top three department where the employees left compared to the other departments
Key drivers of employees who left are because,
Most of the employees are mostly burned out working long hours, more number of projects and not receiving any benefits such as promotion or higher salary. Dissatisfaction is prevalent among overworked staff, especially when rewards such as promotion or salary increments are absent. This findings highlight potential issues in poor company management and the company HR policies that might have to be investigated further and strategic action to improve employee satisfaction and retention.
To assess the likelihood of the employee retention, logistic
regression and tree based models - decision tree and random forest
models were employed. The categorical variables are first encoded -
salary was mapped ordinally from low to high, and
department was mapped using dummy variables to retial
information of all the categories. Outliers in the tenure
was also removed to ensure better model stability and performance.
Logistic regression was selected for interpretability and ability to model linear relationships, while decision tree models were implemented to capture potential non-linear patters and interaction between variables. These two models offer complementary perspectives on the dataset, allowing for a more robust evaluation of predictive power and feature importance.
# Encoding the categorical into numerical
# Copy the dataframe for the modelling
enc_df = df1.copy()
# Mapping the salary category with ordinal numbers according to hierarchy
salary_map = {'low':0, 'medium':1, 'high':2}
# Creating a new column for the salary map
enc_df['salary'] = enc_df['salary'].map(salary_map)
# Encoding the `department` with dummy variables
enc_df = pd.get_dummies(enc_df, drop_first=False)
enc_df.head().style.set_table_attributes("class='table table-sm'")
| satisfaction | last_eval | #_projects | avg_mon_hrs | tenure | work_accident | left | promotion_<5yrs | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.380000 | 0.530000 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.800000 | 0.860000 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.110000 | 0.880000 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.720000 | 0.870000 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.370000 | 0.520000 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
# Removing the outliers in the tenure and saving it in a new dataframe
df_lr = enc_df[(enc_df['tenure'] >= lower_limit) & (enc_df['tenure'] <= upper_limit)]
df_lr.head().reset_index(drop=True).style.set_table_attributes("class='table table-sm'")
| satisfaction | last_eval | #_projects | avg_mon_hrs | tenure | work_accident | left | promotion_<5yrs | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.380000 | 0.530000 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.110000 | 0.880000 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.720000 | 0.870000 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.370000 | 0.520000 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.410000 | 0.500000 | 2 | 153 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
list(df_lr.shape)
## [11167, 19]
Logistic regression is a supervised machine learning algorithm used for classification problems, especially when the target variable is binary. Unlike linear regression, which is used to predict continuous outcomes, logistic regression estimates the probability that a given input belongs to a particular category. It uses the logistic (sigmoid) function to map predicted values to a range between 0 and 1, making it ideal for predicting boolean outcomes.
Starting the logistic regression by setting the target and predictor variables. Then training the model with the training dataset and then using the test dataset to test the model.
# Setting the 'y' variable
y = df_lr['left']
# Setting the 'x' variable with dropping the left column
X = df_lr.drop('left', axis=1)
# Split the data into training (75%) and test (25%) dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)
# Constructing the LogReg model
log_clf = LogisticRegression(random_state=0, max_iter=500)
# Fitting the model
log_clf.fit(X_train,y_train)
LogisticRegression(max_iter=500, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
| penalty | 'l2' | |
| dual | False | |
| tol | 0.0001 | |
| C | 1.0 | |
| fit_intercept | True | |
| intercept_scaling | 1 | |
| class_weight | None | |
| random_state | 0 | |
| solver | 'lbfgs' | |
| max_iter | 500 | |
| multi_class | 'deprecated' | |
| verbose | 0 | |
| warm_start | False | |
| n_jobs | None | |
| l1_ratio | None |
# Use the model for the test dataset
y_pred = log_clf.predict(X_test)
# Constructing a confusion matrix
# Computing values in the matrix
log_cm = confusion_matrix(y_test, y_pred, labels=log_clf.classes_)
# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix = log_cm,
display_labels = log_clf.classes_)
# Plot confusion matrix
log_disp.plot(values_format='')
## <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000028E3290F8E0>
# Display plot
plt.show()
Model accurately predicts,
Checking the class imbalance
df_lr['left'].value_counts(normalize=True)
## left
## 0 0.831468
## 1 0.168532
## Name: proportion, dtype: float64
The data shows 83% - 17% split and shows imbalance. The class distribution is imbalanced, with only 17% of the data representing employees who left.
# Create classification report for logistic regression model
row_names = ['Predicted would not leave', 'Predicted would leave']
# Generate report as dict
report_logr = classification_report(y_test, y_pred, target_names = row_names, output_dict=True)
# Choose averaging strategy
# Extract metrics
avg_type = 'weighted avg'
precision = report_logr[avg_type]['precision']
recall = report_logr[avg_type]['recall']
f1 = report_logr[avg_type]['f1-score']
accuracy = accuracy_score(y_test, y_pred)
# Needed for AUC
y_proba = log_clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
# Create the row
summary_row = {
'model': 'Logistic Regression',
'precision': round(precision, 6),
'recall': round(recall, 6),
'F1': round(f1, 6),
'accuracy': round(accuracy, 6),
'auc': round(auc, 6)
}
# Convert to DataFrame
report_logr = pd.DataFrame([summary_row])
report_logr.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.793086 | 0.825573 | 0.801372 | 0.825573 | 0.892291 |
Classification report shows,
The model shows very low scores in the objective which is the importance to predict employees who will leave.
Tree-based models, such as Decision Trees and Random Forests, are powerful and intuitive methods used for both classification and regression tasks.
A Decision Tree splits the data into subsets based on the value of input features, creating a tree-like structure where each internal node represents a decision rule. It is easy to interpret but can be prone to overfitting.
To overcome this, Random Forest, an ensemble technique, builds multiple Decision Trees on different random subsets of the data and aggregates their predictions to improve accuracy and generalization. Random Forests reduce variance and handle high-dimensional data well, making them robust for complex modeling tasks.
Starting the tree based models by setting the target and predictor variables. Then training the model with the training dataset and then using the test dataset to test the model.
# Using the enc_df dataframe
# Setting the y variable
y = enc_df['left']
# Setting the X variable
X = enc_df.drop('left',axis=1)
# Split the data into training (75%) and test (25%) dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)
# Instantia the decision tree model
tree = DecisionTreeClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[2, 4, 6, None],
'min_samples_leaf': [2, 6, 3],
'min_samples_split': [2, 5,7]
}
# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Instantiate GridSearch
dtree1 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
# Fitting the model
dtree1.fit(X_train,y_train)
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [2, 4, 6, None],
'min_samples_leaf': [2, 6, 3],
'min_samples_split': [2, 5, 7]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | estimator | DecisionTreeC...andom_state=0) | |
| param_grid | {'max_depth': [2, 4, ...], 'min_samples_leaf': [2, 6, ...], 'min_samples_split': [2, 5, ...]} | |
| scoring | ['accuracy', 'precision', ...] | |
| n_jobs | None | |
| refit | 'roc_auc' | |
| cv | 4 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
DecisionTreeClassifier(max_depth=4, min_samples_leaf=6, random_state=0)
| criterion | 'gini' | |
| splitter | 'best' | |
| max_depth | 4 | |
| min_samples_split | 2 | |
| min_samples_leaf | 6 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | 0 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
# Check best parameters
print(dtree1.best_params_)
## {'max_depth': 4, 'min_samples_leaf': 6, 'min_samples_split': 2}
# Check best AUC score on CV
print(dtree1.best_score_)
## 0.9698667651120891
def make_results(model_name:str, model_object, metric:str):
'''
Arguments:
model_name (string): what you want the model to be called in the output table
model_object: a fit GridSearchCV object
metric (string): precision, recall, f1, accuracy, or auc
Returns a pandas df with the F1, recall, precision, accuracy, and auc scores
for the model with the best mean 'metric' score across all validation folds.
'''
# Create dictionary that maps input metric to actual metric name in GridSearchCV
metric_dict = {'auc': 'mean_test_roc_auc',
'precision': 'mean_test_precision',
'recall': 'mean_test_recall',
'f1': 'mean_test_f1',
'accuracy': 'mean_test_accuracy'
}
# Get all the results from the CV and put them in a df
cv_results = pd.DataFrame(model_object.cv_results_)
# Isolate the row of the df with the max(metric) score
best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]
# Extract Accuracy, precision, recall, and f1 score from that row
auc = best_estimator_results.mean_test_roc_auc
f1 = best_estimator_results.mean_test_f1
recall = best_estimator_results.mean_test_recall
precision = best_estimator_results.mean_test_precision
accuracy = best_estimator_results.mean_test_accuracy
# Create table of results
table = pd.DataFrame()
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy],
'auc': [auc]
})
return table
# Get all CV scores
dtree1_cv_results = make_results('Decision Tree 1 CV', dtree1, 'auc')
dtree1_cv_results.reset_index(drop=True, inplace=True)
dtree1_cv_results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Decision Tree 1 CV | 0.914490 | 0.916279 | 0.915345 | 0.971867 | 0.969867 |
The Decision Tree model demonstrated strong performance, achieving high scores across key metrics such as,
These results indicate that the model fits the data well. However,
the decision tree model is prone to overfitting. To address this concern
and ensure better generalization, Random Forest Model was
performed to compare the models.
# Instantiate model
rf = RandomForestClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3,5, None],
'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3,4],
'n_estimators': [300, 500],
}
# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Instantiate GridSearch
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)
# Fitting the model
rf1.fit(X_train, y_train)
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
param_grid={'max_depth': [3, 5, None], 'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | estimator | RandomForestC...andom_state=0) | |
| param_grid | {'max_depth': [3, 5, ...], 'max_features': [1.0], 'max_samples': [0.7, 1.0], 'min_samples_leaf': [1, 2, ...], ...} | |
| scoring | ['accuracy', 'precision', ...] | |
| n_jobs | -1 | |
| refit | 'roc_auc' | |
| cv | 4 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
RandomForestClassifier(max_depth=5, max_features=1.0, max_samples=0.7,
min_samples_split=4, n_estimators=500, random_state=0)| n_estimators | 500 | |
| criterion | 'gini' | |
| max_depth | 5 | |
| min_samples_split | 4 | |
| min_samples_leaf | 1 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | 1.0 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| bootstrap | True | |
| oob_score | False | |
| n_jobs | None | |
| random_state | 0 | |
| verbose | 0 | |
| warm_start | False | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| max_samples | 0.7 | |
| monotonic_cst | None |
# Check best params
rf1.best_params_
## {'max_depth': 5, 'max_features': 1.0, 'max_samples': 0.7, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 500}
# Check best AUC score on CV
rf1.best_score_
## np.float64(0.9804250949807172)
# Get all CV scores
rf1_cv_results = make_results('Random Forest 1 CV', rf1, 'auc')
results = pd.concat([rf1_cv_results,dtree1_cv_results], axis=0)
results.reset_index(drop=True, inplace=True)
results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Random Forest 1 CV | 0.950023 | 0.915614 | 0.932467 | 0.977983 | 0.980425 |
| 1 | Decision Tree 1 CV | 0.914490 | 0.916279 | 0.915345 | 0.971867 | 0.969867 |
The Random Forest model also demonstrated strong
performance on the training set, achieving high scores across key
metrics such as,
These metrics indicate the model’s strong ability to correctly classify both classes while maintaining high generalization performance.
Based on the model training results,
Random Forest Model scores better than the Decision
Tree, achieved higher precision, F1-score, accuracy, and AUC.Random Forest Random Forest helps reduce overfitting, a
known issue with standalone Decision Trees and improving predictive
reliability.Random Forest Model performs well than the Decision
Tree and the test set can be evaluated using the Random Forest.
def get_scores(model_name:str, model, X_test_data, y_test_data):
'''
Generate a table of test scores.
In:
model_name (string): How you want your model to be named in the output table
model: A fit GridSearchCV object
X_test_data: numpy array of X_test data
y_test_data: numpy array of y_test data
Out: pandas df of precision, recall, f1, accuracy, and AUC scores for your model
'''
preds = model.best_estimator_.predict(X_test_data)
auc = roc_auc_score(y_test_data, preds)
accuracy = accuracy_score(y_test_data, preds)
precision = precision_score(y_test_data, preds)
recall = recall_score(y_test_data, preds)
f1 = f1_score(y_test_data, preds)
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'f1': [f1],
'accuracy': [accuracy],
'AUC': [auc]
})
return table
# Get predictions on test data
rf1_test_scores = get_scores('Random Forest 1 Test', rf1, X_test, y_test)
rf1_test_scores.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | Random Forest 1 Test | 0.964211 | 0.919679 | 0.941418 | 0.980987 | 0.956439 |
The Random Forest model also demonstrated strong
performance on the test set, achieving high scores across key metrics
such as,
Test results are similar and slightly higher to the training results, which shows that the model is very good. Scores on precision, recall and F1-score effectively balances both false positive and false negatives and accuracy and AUC indicating excellent discriminatory power between the class. The model’s performance will be similar when new unseen data is fitted, as the test data was used only for this model.
Round 1 Models included all the variables as features for the model prediction. For the Round 2 Models, Feature engineering will be used to customize the data for improving the model.
Relevant variables were selected and transformed to improve model performance, including encoding categorical features, handling missing values, and creating meaningful derived features to enhance predictive accuracy. What can be engineered in this dataset,
# Drop `satisfaction_level` and save resulting dataframe in new variable
df2 = enc_df.drop('satisfaction', axis=1)
# Display first few rows of new dataframe
df2.head().style.set_table_attributes("class='table table-sm'")
| last_eval | #_projects | avg_mon_hrs | tenure | work_accident | left | promotion_<5yrs | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.530000 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.860000 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.880000 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.870000 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.520000 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
# Create `overworked` column. For now, it's identical to average monthly hours.
df2['overworked'] = df2['avg_mon_hrs']
# Inspect max and min average monthly hours values
print('Max hours:', df2['overworked'].max())
## Max hours: 310
print('Min hours:', df2['overworked'].min())
## Min hours: 96
# Define `overworked` as working > 175 hrs/week
df2['overworked'] = (df2['overworked'] > 175).astype(int)
# Display first few rows of new column
df2[['overworked']].head().style.set_table_attributes("class='table table-sm'")
| overworked | |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
Assuming the 40 hrs job/per week with two weeks vacation policy, Average working hours per month = 40 hours * 50 weeks / 12 months = 166.67 hours. Overworked can be defined as working hours more than 175 hours per month on average. Therefore, employees working more than 175 hours/month were classified as overworked (1), while others were labeled as not overworked (0).
To enrich the dataset with meaningful predictors, a new binary feature overworked was engineered based on average monthly working hours. This engineered feature adds interpretability to the model and allows it to capture the potential impact of excessive working hours on employee behavior or outcomes.
# Drop the `average_monthly_hours` column
df2 = df2.drop('avg_mon_hrs', axis=1)
# Display first few rows of resulting dataframe
df2.head().style.set_table_attributes("class='table table-sm'")
| last_eval | #_projects | tenure | work_accident | left | promotion_<5yrs | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | overworked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.530000 | 2 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 0 |
| 1 | 0.860000 | 5 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False | 1 |
| 2 | 0.880000 | 7 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False | 1 |
| 3 | 0.870000 | 5 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 1 |
| 4 | 0.520000 | 2 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 0 |
# Isolate the outcome variable
y = df2['left']
# Select the features
X = df2.drop('left', axis=1)
# Create test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
# Instantiate model
tree = DecisionTreeClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Instantiate GridSearch
dtree2 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
dtree2.fit(X_train, y_train)
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | estimator | DecisionTreeC...andom_state=0) | |
| param_grid | {'max_depth': [4, 6, ...], 'min_samples_leaf': [2, 5, ...], 'min_samples_split': [2, 4, ...]} | |
| scoring | ['accuracy', 'precision', ...] | |
| n_jobs | None | |
| refit | 'roc_auc' | |
| cv | 4 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
DecisionTreeClassifier(max_depth=6, min_samples_leaf=2, min_samples_split=6,
random_state=0)| criterion | 'gini' | |
| splitter | 'best' | |
| max_depth | 6 | |
| min_samples_split | 6 | |
| min_samples_leaf | 2 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | 0 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
# Check best params
dtree2.best_params_
## {'max_depth': 6, 'min_samples_leaf': 2, 'min_samples_split': 6}
# Check best AUC score on CV
dtree2.best_score_
## np.float64(0.9586752505340426)
# Get all CV scores
dtree2_cv_results = make_results('Decision Tree 2 CV', dtree2, 'auc')
results = pd.concat([dtree1_cv_results,dtree2_cv_results,rf1_cv_results], axis=0)
results.reset_index(drop=True, inplace=True)
results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Decision Tree 1 CV | 0.914490 | 0.916279 | 0.915345 | 0.971867 | 0.969867 |
| 1 | Decision Tree 2 CV | 0.856693 | 0.903553 | 0.878882 | 0.958523 | 0.958675 |
| 2 | Random Forest 1 CV | 0.950023 | 0.915614 | 0.932467 | 0.977983 | 0.980425 |
The Decision tree 2 model balanced performance across
all metrics, but slightly underperforms compared to both
Decision Tree 1 CV and Random Forest 1 CV.
such as,
# Instantiate model
rf = RandomForestClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3,5, None],
'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3,4],
'n_estimators': [300, 500],
}
# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Instantiate GridSearch
rf2 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)
# Fitting the Model
rf2.fit(X_train, y_train)
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
param_grid={'max_depth': [3, 5, None], 'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | estimator | RandomForestC...andom_state=0) | |
| param_grid | {'max_depth': [3, 5, ...], 'max_features': [1.0], 'max_samples': [0.7, 1.0], 'min_samples_leaf': [1, 2, ...], ...} | |
| scoring | ['accuracy', 'precision', ...] | |
| n_jobs | -1 | |
| refit | 'roc_auc' | |
| cv | 4 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
RandomForestClassifier(max_depth=5, max_features=1.0, max_samples=0.7,
min_samples_leaf=2, n_estimators=300, random_state=0)| n_estimators | 300 | |
| criterion | 'gini' | |
| max_depth | 5 | |
| min_samples_split | 2 | |
| min_samples_leaf | 2 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | 1.0 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| bootstrap | True | |
| oob_score | False | |
| n_jobs | None | |
| random_state | 0 | |
| verbose | 0 | |
| warm_start | False | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| max_samples | 0.7 | |
| monotonic_cst | None |
# Check best params
rf2.best_params_
## {'max_depth': 5, 'max_features': 1.0, 'max_samples': 0.7, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}
# Check best AUC score on CV
rf2.best_score_
## np.float64(0.9648100662833985)
# Get all CV scores
rf2_cv_results = make_results('Random Forest 2 CV', rf2, 'auc')
results = pd.concat([dtree1_cv_results,dtree2_cv_results,rf1_cv_results,rf2_cv_results], axis=0)
results.reset_index(drop=True, inplace=True)
results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Decision Tree 1 CV | 0.914490 | 0.916279 | 0.915345 | 0.971867 | 0.969867 |
| 1 | Decision Tree 2 CV | 0.856693 | 0.903553 | 0.878882 | 0.958523 | 0.958675 |
| 2 | Random Forest 1 CV | 0.950023 | 0.915614 | 0.932467 | 0.977983 | 0.980425 |
| 3 | Random Forest 2 CV | 0.866758 | 0.878754 | 0.872407 | 0.957411 | 0.964810 |
The Random Forest 2 CV model balanced performance across
all metrics, but slightly underperforms compared to
Random Forest 1 CV and slightly better than
Decision Tree 2 CV, with scores being,
Based on the training results for the two rounds of Decision Tree and Random Forest Model,
Random Forest 1 CV model is the winning model and
the test can now be used for predictionRandom Forest model performs well with
ROC-AUC score as the deciding metric.Plotting a Confusion Matrix to visualize the model’s predictions on the test set
# Generate array of values for confusion matrix
preds = rf2.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, preds, labels=rf2.classes_)
# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=rf2.classes_)
disp.plot(values_format='');
# Get predictions on test data
rf2_test_scores = get_scores('Random Forest 2 Test', rf2, X_test, y_test)
test_results = pd.concat([rf1_test_scores, rf2_test_scores], axis=0)
test_results.reset_index(drop=True, inplace=True)
test_results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | Random Forest 1 Test | 0.964211 | 0.919679 | 0.941418 | 0.980987 | 0.956439 |
| 1 | Random Forest 2 Test | 0.870406 | 0.903614 | 0.886700 | 0.961641 | 0.938407 |
A perfect model would yield all true negatives and true positives,
and no false negatives or false positives. The
Random Forest models demonstrated strong performance on the
test set. These results highlight the robustness and consistency of
Random Forest classifiers in predicting employee outcomes.
Comparing the two models,
Model 1 (baseline Random Forest) outperforms
Model 2 (feature-engineered Random Forest) in terms of
precision, F1-score, accuracy, and AUC. While Model 2 shows a slight
improvement in recall (0.90 vs 0.92), the drop in precision and overall
performance metrics suggests that the current feature engineering may
have introduced noise or redundant information rather than enhancing
signal quality.
In this case, the feature engineering did not improve model performance. Instead, it slightly degraded the classifier’s ability to distinguish between classes. This highlights the importance of validating each feature transformation step, as not all feature engineering enhances model learning — it can also obscure useful signals or increase dimensionality unnecessarily.
report_logr.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.793086 | 0.825573 | 0.801372 | 0.825573 | 0.892291 |
The model shows very low scores in the objective which is the importance to predict employees who will leave. The data shows 83% - 17% split and shows imbalance. The class distribution is imbalanced, with only 17% of the data representing employees who left. This imbalance may contribute to the model’s bias toward predicting that employees will stay.
Based on the insights and limitations, it is worth exploring alternative classification models such as Decision Tree and Random Forest, which may handle non-linear relationships and imbalanced data more effectively, and potentially improve prediction of employee attrition.
# Get all CV scores
dt_results = pd.concat([dtree1_cv_results,dtree2_cv_results], axis=0)
dt_results.reset_index(drop=True, inplace=True)
dt_results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Decision Tree 1 CV | 0.914490 | 0.916279 | 0.915345 | 0.971867 | 0.969867 |
| 1 | Decision Tree 2 CV | 0.856693 | 0.903553 | 0.878882 | 0.958523 | 0.958675 |
Decision Tree 1 (CV) outperformed
Decision Tree 2 (CV) across all evaluation metrics. It
achieved a higher precision, recall, and F1 score, indicating better
balance between false positives and false negatives. Additionally, it
showed superior accuracy and AUC, suggesting a more robust overall
classification performance.
# Plot the tree
plt.figure(figsize=(85,50))
plot_tree(dtree2.best_estimator_, max_depth=6, fontsize=14, feature_names=X.columns,
class_names={0:'stayed', 1:'left'}, filled=True);
plt.show()
# Extract rules as plain text
tree_rules = export_text(dtree2.best_estimator_, feature_names=list(X.columns), max_depth=6)
print(tree_rules)
## |--- #_projects <= 2.50
## | |--- last_eval <= 0.57
## | | |--- overworked <= 0.50
## | | | |--- last_eval <= 0.44
## | | | | |--- class: 0
## | | | |--- last_eval > 0.44
## | | | | |--- tenure <= 2.50
## | | | | | |--- class: 0
## | | | | |--- tenure > 2.50
## | | | | | |--- tenure <= 3.50
## | | | | | | |--- class: 1
## | | | | | |--- tenure > 3.50
## | | | | | | |--- class: 0
## | | |--- overworked > 0.50
## | | | |--- department_sales <= 0.50
## | | | | |--- department_IT <= 0.50
## | | | | | |--- class: 0
## | | | | |--- department_IT > 0.50
## | | | | | |--- tenure <= 3.00
## | | | | | | |--- class: 0
## | | | | | |--- tenure > 3.00
## | | | | | | |--- class: 0
## | | | |--- department_sales > 0.50
## | | | | |--- last_eval <= 0.47
## | | | | | |--- class: 0
## | | | | |--- last_eval > 0.47
## | | | | | |--- last_eval <= 0.48
## | | | | | | |--- class: 1
## | | | | | |--- last_eval > 0.48
## | | | | | | |--- class: 0
## | |--- last_eval > 0.57
## | | |--- last_eval <= 1.00
## | | | |--- last_eval <= 0.75
## | | | | |--- department_technical <= 0.50
## | | | | | |--- department_marketing <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- department_marketing > 0.50
## | | | | | | |--- class: 0
## | | | | |--- department_technical > 0.50
## | | | | | |--- last_eval <= 0.59
## | | | | | | |--- class: 0
## | | | | | |--- last_eval > 0.59
## | | | | | | |--- class: 0
## | | | |--- last_eval > 0.75
## | | | | |--- work_accident <= 0.50
## | | | | | |--- salary <= 1.50
## | | | | | | |--- class: 0
## | | | | | |--- salary > 1.50
## | | | | | | |--- class: 0
## | | | | |--- work_accident > 0.50
## | | | | | |--- class: 0
## | | |--- last_eval > 1.00
## | | | |--- salary <= 0.50
## | | | | |--- class: 1
## | | | |--- salary > 0.50
## | | | | |--- class: 0
## |--- #_projects > 2.50
## | |--- tenure <= 3.50
## | | |--- #_projects <= 5.50
## | | | |--- work_accident <= 0.50
## | | | | |--- salary <= 1.50
## | | | | | |--- last_eval <= 0.95
## | | | | | | |--- class: 0
## | | | | | |--- last_eval > 0.95
## | | | | | | |--- class: 0
## | | | | |--- salary > 1.50
## | | | | | |--- class: 0
## | | | |--- work_accident > 0.50
## | | | | |--- department_sales <= 0.50
## | | | | | |--- class: 0
## | | | | |--- department_sales > 0.50
## | | | | | |--- last_eval <= 0.84
## | | | | | | |--- class: 0
## | | | | | |--- last_eval > 0.84
## | | | | | | |--- class: 0
## | | |--- #_projects > 5.50
## | | | |--- department_support <= 0.50
## | | | | |--- last_eval <= 0.89
## | | | | | |--- department_technical <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- department_technical > 0.50
## | | | | | | |--- class: 0
## | | | | |--- last_eval > 0.89
## | | | | | |--- department_sales <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- department_sales > 0.50
## | | | | | | |--- class: 0
## | | | |--- department_support > 0.50
## | | | | |--- overworked <= 0.50
## | | | | | |--- last_eval <= 0.73
## | | | | | | |--- class: 0
## | | | | | |--- last_eval > 0.73
## | | | | | | |--- class: 1
## | | | | |--- overworked > 0.50
## | | | | | |--- last_eval <= 0.64
## | | | | | | |--- class: 0
## | | | | | |--- last_eval > 0.64
## | | | | | | |--- class: 0
## | |--- tenure > 3.50
## | | |--- last_eval <= 0.76
## | | | |--- #_projects <= 6.50
## | | | | |--- department_technical <= 0.50
## | | | | | |--- #_projects <= 5.50
## | | | | | | |--- class: 0
## | | | | | |--- #_projects > 5.50
## | | | | | | |--- class: 0
## | | | | |--- department_technical > 0.50
## | | | | | |--- #_projects <= 3.50
## | | | | | | |--- class: 0
## | | | | | |--- #_projects > 3.50
## | | | | | | |--- class: 0
## | | | |--- #_projects > 6.50
## | | | | |--- class: 1
## | | |--- last_eval > 0.76
## | | | |--- #_projects <= 4.50
## | | | | |--- tenure <= 4.50
## | | | | | |--- last_eval <= 0.99
## | | | | | | |--- class: 0
## | | | | | |--- last_eval > 0.99
## | | | | | | |--- class: 0
## | | | | |--- tenure > 4.50
## | | | | | |--- #_projects <= 3.50
## | | | | | | |--- class: 0
## | | | | | |--- #_projects > 3.50
## | | | | | | |--- class: 1
## | | | |--- #_projects > 4.50
## | | | | |--- overworked <= 0.50
## | | | | | |--- department_support <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- department_support > 0.50
## | | | | | | |--- class: 0
## | | | | |--- overworked > 0.50
## | | | | | |--- #_projects <= 5.50
## | | | | | | |--- class: 1
## | | | | | |--- #_projects > 5.50
## | | | | | | |--- class: 1
Simplified, readable form of the main logic in the tree:
# Feature important
dtree2_importances = pd.DataFrame(dtree2.best_estimator_.feature_importances_,
columns=['gini_importance'],
index=X.columns
)
dtree2_importances = dtree2_importances.sort_values(by='gini_importance', ascending=False)
# Only extract the features with importances > 0
dtree2_importances = dtree2_importances[dtree2_importances['gini_importance'] != 0]
dtree2_importances.style.set_table_attributes("class='table table-sm'")
| gini_importance | |
|---|---|
| last_eval | 0.343958 |
| #_projects | 0.343385 |
| tenure | 0.215681 |
| overworked | 0.093498 |
| department_support | 0.001142 |
| salary | 0.000910 |
| department_sales | 0.000607 |
| department_technical | 0.000418 |
| work_accident | 0.000183 |
| department_IT | 0.000139 |
| department_marketing | 0.000078 |
sns.barplot(data=dtree2_importances, x="gini_importance", y=dtree2_importances.index, orient='h')
plt.title("Decision Tree: Feature Importances for Employee Leaving", fontsize=14)
plt.ylabel("Feature")
plt.xlabel("Importance")
plt.show()
Feature importance plot for the decision tree model shows that
last_evaluation, #_project,
tenure, and overworked have the importance
from high to the least which helps in predicting the outcome variable
‘employee left’. In contrast, features like department,
salary, and work accident contribute minimally
to the prediction. This suggests that performance evaluation, workload,
and time spent at the company are key factors influencing employee
attrition.
# Get all rf CV scores
rf_results = pd.concat([rf1_cv_results,rf2_cv_results], axis=0)
rf_results.reset_index(drop=True, inplace=True)
rf_results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Random Forest 1 CV | 0.950023 | 0.915614 | 0.932467 | 0.977983 | 0.980425 |
| 1 | Random Forest 2 CV | 0.866758 | 0.878754 | 0.872407 | 0.957411 | 0.964810 |
# Get all rf test scores
rf_test_results = pd.concat([rf1_test_scores, rf2_test_scores], axis=0)
rf_test_results.reset_index(drop=True, inplace=True)
rf_test_results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | Random Forest 1 Test | 0.964211 | 0.919679 | 0.941418 | 0.980987 | 0.956439 |
| 1 | Random Forest 2 Test | 0.870406 | 0.903614 | 0.886700 | 0.961641 | 0.938407 |
The Random Forest 1 model consistently outperformed
Random Forest 2 across both cross-validation (CV) and test
datasets. In cross-validation, Random Forest 1 achieved a
higher F1 score and accuracy compared to Random Forest 2.
Similarly, in the test set, Random Forest 1 yielded better
performance with an F1 score and accuracy than
Random Forest 2. The higher AUC values across both CV and
test sets further confirm the superior classification performance of
Random Forest 1.
Now, plot the feature importance for the Random Forest 2
model.
# Get feature importances
feat_impt = rf2.best_estimator_.feature_importances_
# Get indices of top 10 features
ind = np.argpartition(rf2.best_estimator_.feature_importances_, -10)[-10:]
# Get column labels of top 10 features
feat = X.columns[ind]
# Filter `feat_impt` to consist of top 10 feature importance
feat_impt = feat_impt[ind]
y_df = pd.DataFrame({"Feature":feat,"Importance":feat_impt})
y_sort_df = y_df.sort_values("Importance")
fig = plt.figure()
ax = fig.add_subplot(111)
y_sort_df.plot(kind='barh', ax=ax, x="Feature", y="Importance")
ax.set_title("Random Forest 2: Important variables that have an impact in employees leaving", fontsize = 14)
ax.set_ylabel("Feature")
ax.set_xlabel("Importance")
plt.show()
Feature importance plot for the Random Forest is the same as the Decision Tree model - feature importance plot
# Get all model scores
all_models = [report_logr, dtree1_cv_results, dtree2_cv_results, rf1_cv_results, rf2_cv_results, rf1_test_scores, rf2_test_scores]
model_results = pd.concat(all_models, axis=0)
model_results.reset_index(drop=True, inplace=True)
model_results.style.set_table_attributes("class='table table-sm'")
| model | precision | recall | F1 | accuracy | auc | f1 | AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.793086 | 0.825573 | 0.801372 | 0.825573 | 0.892291 | nan | nan |
| 1 | Decision Tree 1 CV | 0.914490 | 0.916279 | 0.915345 | 0.971867 | 0.969867 | nan | nan |
| 2 | Decision Tree 2 CV | 0.856693 | 0.903553 | 0.878882 | 0.958523 | 0.958675 | nan | nan |
| 3 | Random Forest 1 CV | 0.950023 | 0.915614 | 0.932467 | 0.977983 | 0.980425 | nan | nan |
| 4 | Random Forest 2 CV | 0.866758 | 0.878754 | 0.872407 | 0.957411 | 0.964810 | nan | nan |
| 5 | Random Forest 1 Test | 0.964211 | 0.919679 | nan | 0.980987 | nan | 0.941418 | 0.956439 |
| 6 | Random Forest 2 Test | 0.870406 | 0.903614 | nan | 0.961641 | nan | 0.886700 | 0.938407 |
From the initial assessment, EDA and Visualization, the employees are overworked due to the poor company management. This is also confirmed with the model and feature importance
Following recommendations could be presented to the stakeholders for retaining the employees:
Next Steps Having a structured method for getting employees evaluation and satisfaction score before the employee leaves the company, as this might tend to data leakage. This might help in mitigating this issues and will help improve the model’s performance.